Data for Today
The ncbirths dataset is a random sample of 1,000 cases taken from a larger dataset collected in North Carolina in 2004.
Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age).
Draw a picture of how would you expect this dataset to look.
Relationships Between Variables
Visualizing Linear Regression
Characterizing Relationships
Form (e.g. linear, quadratic, non-linear)
Direction (e.g. positive, negative)
Strength (how much scatter/noise?)
Unusual observations (do points not fit the overall pattern?)
Your Turn!
How would your characterize this relationship?
What if you added another variable?
Summarizing a Linear Relationship
Correlation:
strength and direction of a linear relationship between two quantitative variables
Anscombe Correlations
Four datasets, very different graphical presentations
For which of these relationships is correlation a reasonable summary measure?
The Importance of Language
The word “correlation” has both a precise mathematical definition and a more general definition for typical usage in English.
These uses are obviously related and generally in sync.
There are times when these two uses can be conflated and/or misconstrued.
Linear Regression
Models are ubiquitous in Statistics!
We often assume that the value of our response variable is some function of our explanatory variable, plus some random noise.
Estimated / Fitted Regression Model
\[ \widehat{y} = b_0 + b_1 \cdot x \]
Why does this equation have a hat on y?
Coefficient Estimates
Our focus (for now…)
Estimated regression equation
\[\widehat{y} = b_0 + b_1 \cdot x\]
# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept -5.34 0.565 -9.45 0 -6.45 -4.23
2 weeks 0.325 0.015 22.2 0 0.296 0.354
Write out the estimated regression equation!
How do you interpret the intercept value of -5.341?
How do you interpret the slope value of 0.325?
Obtaining Residuals
\(\widehat{weight} = -5.341 + 0.325 \cdot weeks\)
What would the residual be for a pregnancy that lasted 39 weeks and whose baby weighed 7.63 pounds?
A different explanatory variable
# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 6.74 0.102 65.8 0 6.54 6.94
2 gained 0.015 0.003 4.79 0 0.009 0.021
Write out this estimated regression equation!
How would you choose which model is better?!
Categorical Explanatory Variables
Indicator Variables
\(x\) is a categorical variable with levels:
"nonsmoker""smoker"Need:
\(1_{smoker}(x) = 1\) if the mother was a "smoker"
\(1_{smoker}(x) = 0\) if the mother was a "nonsmoker"
A different equation
# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 7.23 0.047 155. 0 7.14 7.32
2 habit: smoker -0.4 0.13 -3.07 0.002 -0.656 -0.145
What is the estimated mean birth weight for nonsmoking mothers?